14 research outputs found
Full Page Handwriting Recognition via Image to Sequence Extraction
We present a Neural Network based Handwritten Text Recognition (HTR) model
architecture that can be trained to recognize full pages of handwritten or
printed text without image segmentation. Being based on an Image to Sequence
architecture, it can be trained to extract text present in an image and
sequence it correctly without imposing any constraints on language, shape of
characters or orientation and layout of text and non-text. The model can also
be trained to generate auxiliary markup related to formatting, layout and
content. We use character level token vocabulary, thereby supporting proper
nouns and terminology of any subject. The model achieves a new state-of-art in
full page recognition on the IAM dataset and when evaluated on scans of real
world handwritten free form test answers - a dataset beset with curved and
slanted lines, drawings, tables, math, chemistry and other symbols - it
performs better than all commercially available HTR APIs. It is deployed in
production as part of a commercial web application
Recommended from our members
Anytime Recognition of Objects and Scenes
Humans are capable of perceiving a scene at a glance, and obtain deeper understanding with additional time. Computer visual recognition should be similarly robust to varying computational budgets --- a property we call Anytime recognition. We present a general method for learning dynamic policies to optimize Anytime performance in visual recognition. We approach this problem from the perspective of Markov Decision Processes, and use reinforcement learning techniques. Crucially, decisions are made at test time and depend on observed data and intermediate results. Our method is applicable to a wide variety of existing detectors and classifiers, as it learns from execution traces and requires no special knowledge of their implementation.We first formulate a dynamic, closed-loop policy that infers the contents of the image in order to decide which single-class detector to deploy next. We explain effective decisions for reward function definition and state-space featurization, and evaluate our method on the PASCAL VOC dataset with a novel costliness measure, computed as the area under an Average Precision (AP) vs. Time curve. In contrast to previous work, our method significantly diverges from predominant greedy strategies and learns to take actions with deferred values. If execution is stopped when only half the detectors have been run, our method obtains 66% better mean AP than a random ordering, and 14% better performance than an intelligent baseline.The detection actions are costly relative to the inference performed in executing our policy. Next, we apply our approach to a setting with less costly actions: feature selection for linear classification. We explain strategies for dealing with unobserved feature values that are necessary to effectively classify from any state in the sequential process. We show the applicability of this system to a challenging synthetic problem and to benchmark problems in scene and object recognition. On suitable datasets, we can additionally incorporate a semantic back-off strategy that gives maximally specific predictions for a desired level of accuracy. Our method delivers best results on the costliness measure, and provides a new view on the time course of human visual perception.Traditional visual recognition obtains significant advantages from the use of many features in classification. Recently, however, a single feature learned with multi-layer convolutional networks (CNNs) has outperformed all other approaches on the main recognition datasets. We propose Anytime-motivated methods for speeding up CNN-based detection approaches while maintaining their high accuracy: (1) a dynamic region selection method using novel quick-to-compute features; and (2) the Cascade CNN, which adds a reject option between expensive convolutional layers and allows the network to terminate some computation early. On the PASCAL VOC dataset, we achieve an 8x speed-up while losing no more than 10% of the top detection performance.Lastly, we address the problem of image style recognition, which has received little research attention despite the significant role of visual style in conveying meaning through images. We present two novel datasets: 80K Flickr photographs annotated with curated style labels, and 85K paintings annotated with style/genre labels. In preparation for Anytime recognition, we perform a thorough evaluation of different image features for image style prediction. We find that features learned in a multi-layer network perform best, even when trained with object category labels. Our large-scale learning method also results in the best published performance on an existing dataset of aesthetic ratings and photographic style annotations. We use the learned classifiers to extend traditional tag-based image search to consider stylistic constraints, and demonstrate cross-dataset understanding of style
Dynamic Feature Selection for Classification on a Budget
The test-time e cient classification problem consists of • N instances labeled with one of K labels: D = {xn 2X,yn 2Y = {1,...,K}} N n=1. • F features H = {hf: X 7! R df F f=1, with associated costs cf. • Budget-sensitive loss LB, composed of cost budget B and loss function `(ˆy, y) 7! R. IHs (Y; hf) a = hf IHs(Y; hf)(Bs 1 2 cf) The goal is to find a feature selection policy ⇡(x): X 7! 2 H and a feature combination classifier g(H⇡):2 H 7! Y such that such that the total budgetsensitive loss P LB(g(⇡(xn)),yn) is minimized. The cost of a selected feature subset H ⇡(x) is C H⇡(x). The budget-sensitive loss LB presents a hard budget constraint by only accepting answers with CH apple B. Additionally, LB can be cost-sensitive: answers given with less cost are more valuable than costlier answers. The motivation for the latter property is Anytime performance; we should be able to stop our algorithm’s execution at any time and have the best possible answer cf Figure 1. Definition of the reward function. We seek to maximize the total area above the entropy vs. cost curve from 0 to B, and so define the reward of an individual action as the area of the slice of the total area that it contributes. From state s, actionhleads to state s 0 with cost cf. The information gain of the action a = hf is IHs(Y; hf)=H(Y; Hs) H(Y; Hs [ hf)